MCIC Wooster, OSU
2024-02-01
The transcriptome is the full set of transcripts expressed by an organism, which:
Is not at all stable across time & space in any given organism
(unlike the genome but much like the proteome)
Varies both qualitatively (which transcripts are expressed) but especially quantitatively (how much of each transcript is expressed)
Transcriptomics is the study of the transcriptome,
i.e. the large-scale study of RNA transcripts expressed in an organism.
Many approaches & applications — but most commonly, transcriptomics focuses on:
mRNA rather than on noncoding RNA types such as rRNA, tRNA, and miRNA
Quantifying gene expression levels (& ignoring nucleotide-level variation)
Statistically comparing expression between groups (treatments, populations, tissues)
https://hbctraining.github.io
Considering…
That protein production gives clues about the activity of specific biological functions, and the molecular mechanisms underlying those functions;
That it is much easier to measure transcript expression than protein expression at scale;
The central dogma
… we can use gene expression levels as a proxy for protein expression levels and make functional inferences.
Specifically, we can use transcriptomics to:
Find the pathways and genes that:
Underlie phenotypic responses
Explain differences between groups (treatments, genotypes, sexes, tissues, etc.)
Can be targeted to enhance or reduce responses for pathogen and pest control
RNA-seq is the current state-of-the-art family of methods to study the transcriptome.
It involves the random sequencing of millions of transcript fragments per sample.
We will focus on the most common type of RNA-seq, which:
Does not actually sequence the RNA, but first reverse transcribes RNA to cDNA
Attempts to sequence only mRNA while avoiding noncoding RNAs (“mRNA-seq”)
Does not distinguish between RNA from different cell types (“bulk RNA-seq”)
Uses short reads (≤150 bp) that do not cover full transcripts but do uniquely ID genes
RNA-seq data can also be used for applications other than expression quantification:
For organisms without a reference genome: identify genes present in the organism
For organisms with a reference genome: discover new genes & transcripts,
and improve genome annotation
All in all, RNA-seq is a very widely used technique —
it constitutes the most common usage of high-throughput sequencing!
RNA-seq is also the most common data type I assist with as an MCIC bioinformatician. Some projects I’ve worked on used it to identify genes & pathways that differ between:
Multiple soybean cultivars in response to Phytophtora sojae inoculation; soybean in response to different Phytophtora species and strains (Dorrance lab, PlantPath)
Wheat vs. Xanthomonas with a gene knock-out vs. knock-in (Jacobs lab, PlantPath)
Mated and unmated mosquitos (Sirot lab, College of Wooster)
Tissues of the ambrosia beetle and its symbiotic fungus (Ranger lab, USDA Wooster)
Diapause-inducing conditions for two pest stink bug species (Michel lab, Entomology)
Human carcinoma cell lines with vs. without a manipulated gene (Cruz lab, CCC)
Pig coronaviruses with vs. without an experimental insertion (Wang lab, CFAH)
And to improve the annotation of a nematode genome (Taylor lab, PlantPath)
RNA-seq typically compares groups of samples defined by differences in:
Treatments (e.g. different host plant, temperature, diet, mated/unmated) and/or
Organismal variants: ages/developmental stages, sexes, or genotypes (lines/biotypes/subspecies/morphs) and/or
Tissues
https://github.com/ScienceParkStudyGroup/rnaseq-lesson
To be able to make statistically supported conclusions about expression differences between such groups of samples, we must have biological replication.
When designing an RNA-seq experiment, keep the following in mind:
Technical replicates?
You won’t need technical replicates that only replicate library prep and/or sequencing, but depending on your experimental design, may want to technically replicate something else.
https://sydney-informatics-hub.github.io/training-RNAseq-slides
Next, “library preparation” is the series of lab steps to produce a collection of molecules ready to be sequenced
Library prep is typically done by sequencing facilities
Fig. from Kukurba & Montgomery 2015 (www.ncbi.nlm.nih.gov/pmc/articles/PMC4863231/)
Sequencing technology
Illumina short reads: by far the most common
PacBio or ONT long reads: consider if sequencing full transcripts (isoforms) is key
Single-end vs. paired-end reads (for Illumina)
Sequencing “depth” / amount — how many reads per sample
From Liu et al. 2014
Modified after Kukurba & Montgomery 2015
You will typically receive a “demultiplexed” (split by sample) set of FASTQ files.
Once you receive your data, the first series of analysis steps involves going from the raw reads to a count table (which will have a read count for each gene in each sample).
This part is bioinformatics-heavy with large files, a need for lots of computing power such as with a supercomputer, command-line (Unix shell) programs — it specifically involves:
Read preprocessing: QC, trimming, and optionally rRNA removal
Aligning reads to a reference genome (+ alignment QC)
Quantifying expression levels
This can be run using standardized, one-size-fits-all workflows, and is therefore (relatively) suitable to be outsourced to a company, facility, or collaborator.
Read pre-processing includes the following steps:
The alignment of reads to a reference genome needs to be “splice-aware”.
Alternatively, you can align to the transcriptome (i.e., all mature transcripts):
An abundance of pre-mRNA versus mature-mRNA.
DNA contamination or poor genome assembly/annotation quality
At heart, a simple counting exercise once you have the alignments in hand.
But made more complicated by sequencing biases and multi-mapping reads.
Current best-performing tools (e.g. Salmon) do transcript-level quantification — even though this is typically followed by gene-level aggregation prior to downstream analysis.
Fast-moving field
Several very commonly used tools like FeatureCounts (>15k citations) and HTSeq (<18k citations) have become disfavored in the past couple of years, as they e.g. don’t count multi-mapping reads at all.
The “nf-core” initiative (https://nf-co.re) attempts to produce best-practice and automated workflows/pipelines, like for RNA-seq (https://nf-co.re/rnaseq):
The second part of RNA-seq data analysis involves analyzing the count table.
In contrast to the first part, this can be done on a laptop and instead is heavier on statistics, data visualization and biological interpretation.
Common steps include:
Principal Component Analysis (PCA)
Assessing overall sample clustering patterns
Differential expression analysis
Finding genes that differ in expression level between sample groups (DEGs)
Functional enrichment analysis
See whether certain gene functions are overrepresented among DEGs
A PCA analysis will help to visualize overall patterns of similarity among samples,
for example whether our groups of interest cluster:
Fig. 1 from Garrigos et al. 2023
A Differential Expression (DE) analysis allows you to test, for every single expressed gene in your dataset, whether it significantly differs in expression level between groups.
Typically, this is done with pairwise comparisons between groups:
Gene count normalization
To be able to fairly compare samples, raw gene counts need to be adjusted:
Probability distribution of the count data
Multiple-testing correction
10,000+ genes are independently tested during a DE analysis, so there is a dire need for multiple testing correction.
The standard method is the Benjamini-Hochberg (BH) method.
Log2-fold changes (LFC) as a measure of expression difference
R packages to the rescue
Specialized R/Bioconductor packages like DESeq2 and EdgeR make differential expression analysis relatively straightforward and automatically take care of the abovementioned considerations (we will use DESeq2 in the lab).
Lists of DEGs can be quite long, and it is not always easy to make biological sense of them. Functional enrichment analyses help with this.
Functional enrichment analyses check whether certain functional categories of genes (e.g., biological processes, or pathways) are statistically overrepresented among up- and/or downregulated genes.
There are a number of databases that group genes into functional categories, but the two main ones used for enrichment analysis are:
Gene Ontology (GO)
Kyoto Encyclopedia of Genes and Genomes (KEGG)
Genes are assigned zero, one or more GO “terms”
Hierarchical structure with more specific terms grouping into more general terms
Highest-level grouping are the three “ontologies”: Biological Process, Molecular Function, Cellular Component
Fig. 4 from Garrigos et al. 2023
Focus on pathways for cellular and organismal functions whose genes can be drawn and connected in maps
Whereas in well-annotated genomes, the majority of genes has one or more GO terms associated with it, much fewer genes are usually annotated with KEGG orthologs